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ABSTRACT 

We  consider  proposals  to  improve  the  performance  of  the  Tor  over¬ 
lay  network  hy  increasing  the  number  of  connections  between  re¬ 
lays,  such  as  Torchestra  and  PCTCP.  We  introduce  a  new  class  of 
attacks  that  can  apply  to  these  designs,  socket  exhaustion,  and  show 
that  these  attacks  are  effective  against  PCTCP.  We  also  describe 
IMUX,  a  design  that  generalizes  the  principles  behind  these  de¬ 
signs  while  still  mitigating  against  socket  exhaustion  attacks.  We 
demonstrate  empirically  that  IMUX  resists  socket  exhaustion  while 
finding  that  web  clients  can  realize  up  to  25%  increase  in  perfor¬ 
mance  compared  to  Torchestra.  Finally,  we  empirically  evaluate 
the  interaction  between  these  designs  and  the  recently  proposed 
KIST  design,  which  aims  to  improve  performance  by  intelligently 
scheduling  kernel  socket  writes. 

Categories  and  Subject  Descriptors 

C.2.0  [Computer-Communication  Networks]:  Security  and  Pro¬ 
tection 
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1.  INTRODUCTION 

Tor  is  a  widely-used  tool  for  low-latency  anonymous  and  uncen¬ 
sored  Internet  access,  consisting  of  an  overlay  network  of  roughly 
5000  relay  nodes  and  an  estimated  1  million  daily  users.  In  order 
to  provide  anonymity,  each  Tor  client  periodically  chooses  three  re¬ 
lays  and  iteratively  forms  a  circuit  by  telescopically  contacting  each 
relay  through  the  previously  contacted  relays,  so  that  each  relay 
knows  only  two  hops  in  any  connection.  Because  of  the  high  ratio 
of  clients  to  relays,  and  because  attracting  performance-sensitive 
users  can  improve  the  privacy  of  all  users,  properly  allocating  lim¬ 
ited  bandwidth  and  processing  resources  of  the  relays  in  a  decen¬ 
tralized  and  privacy-preserving  manner  is  crucial. 

One  well-recognized  performance  issue  in  Tor  stems  from  the 
fact  that  all  circuits  passing  between  a  pair  of  relays  are  multi¬ 
plexed  over  a  single  TLS  connection.  As  shown  by  Reardon  and 
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Goldberg  [20],  this  can  result  in  several  undesirable  effects  on  per¬ 
formance:  a  single,  high-volume  circuit  can  lead  to  link  conges¬ 
tion,  throttling  all  circuits  sharing  this  link  [20];  delays  for  packet 
re-transmissions  can  increase  latency  for  other  circuits,  leading  to 
“head-of-line”  blocking  [19];  and  long  write  buffers  reduce  the  ef¬ 
fectiveness  of  application-level  scheduling  decisions  [10]. 

As  a  result,  several  researchers  have  proposed  changes  in  the 
transport  protocol  for  the  links  between  relays.  Reardon  and  Gold¬ 
berg  suggested  that  relays  should  use  a  Datagram  TLS  tunnel  at  the 
transport  level,  while  running  a  separate  TCP  session  at  the  applica¬ 
tion  level  for  each  circuit  [20];  this  adds  a  high  degree  of  complex¬ 
ity  (an  entire  TCP  implementation)  to  the  application.  Similarly, 
the  “Per-Circuit  TCP”  (PCTCP)  design  [5]  establishes  a  TCP  ses¬ 
sion  for  each  circuit,  hiding  the  exact  traffic  volumes  of  these  ses¬ 
sions  by  establishing  an  IPSEC  tunnel  between  each  pair  of  relays; 
however,  kernel-level  TCP  sessions  are  an  exhaustible  resource  and 
we  demonstrate  in  Section  3  that  this  can  lead  to  attacks  on  both 
availability  and  anonymity.  In  contrast,  the  Torchestra  transport 
suggested  by  Gopal  and  Heninger  [8],  has  each  relay  pair  share  one 
TLS  session  for  “bulk  download”  circuits  and  another  for  “interac¬ 
tive  traffic.”  Performance  then  critically  depends  on  the  threshold 
for  deciding  whether  a  given  circuit  is  bulk  or  interactive. 

This  paper  presents  a  third  potential  solution,  inverse-multiplexed 
Tor  with  adaptive  channel  size  (IMUX).  In  IMUX,  each  relay  pair 
maintains  a  set  of  TLS  connections  (channel)  roughly  proportional 
to  the  number  of  “active”  circuits  between  the  pair,  and  all  circuits 
share  these  TLS  connections;  the  total  number  of  connections  per 
relay  is  capped.  As  new  circuits  are  created  or  old  circuits  are  de¬ 
stroyed,  connections  are  reallocated  between  channels.  This  ap¬ 
proach  allows  relays  to  avoid  many  of  the  performance  issues  as¬ 
sociated  with  the  use  of  a  single  TCP  session:  packet  losses  and 
buffering  on  a  single  connection  do  not  cause  delays  or  blocking  on 
the  other  connections  associated  with  a  channel.  At  the  same  time, 
IMUX  can  offer  performance  benefits  over  Torchestra  by  avoiding 
fate  sharing  among  all  interactive  streams,  or  per-circuit  designs 
by  avoiding  the  need  for  TCP  handshaking  and  slow-start  on  new 
circuits.  Compared  to  designs  that  require  a  user-space  TCP  imple¬ 
mentation,  IMUX  has  significantly  reduced  implementation  com¬ 
plexity,  and  due  to  the  use  of  a  per-relay  connection  cap,  IMUX  can 
mitigate  attacks  aimed  at  exhausting  the  available  TCP  sessions  at 
a  target  relay. 
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Contributions.  We  make  the  following  contributions  to  the  Tor 
relay  transport  literature: 

•  We  describe  new  socket  exhaustion  attacks  on  Tor  and  PCTCP 
that  can  anonymously  disable  targeted  relays,  and  demon¬ 
strate  how  socket  exhaustion  leads  to  reductions  in  availabil¬ 
ity,  anonymity,  and  stability. 

•  We  describe  IMUX,  a  novel  approach  to  the  circuit-to-socket 
solution  space.  Our  approach  naturally  generalizes  between 
the  “per-circuit”  approaches  such  as  PCTCP  and  the  fixed 
number  of  sessions  in  “vanilla  Tor”  (1)  and  Torchestra  (2). 

•  We  analyze  a  variety  of  scheduling  designs  for  using  a  vari¬ 
able  number  of  connections  per  channel  through  large-scale 
simulations  with  the  Shadow  simulator  [11].  We  compare 
IMUX  to  PCTCP  and  Torchestra,  and  suggest  parameters  for 
IMUX  that  empirically  outperform  both  related  approaches 
while  avoiding  the  need  for  IPSEC  and  reducing  vulnerabil¬ 
ity  to  attacks  based  on  TCP  session  exhaustion. 

•  We  perform  the  first  large  scale  simulations  of  the  Torchestra 
design  and  the  first  simulations  that  integrate  KIST  [10]  with 
Torchestra,  PCTCP,  and  IMUX  to  compare  the  performance 
interactions  among  the  complimentary  designs. 

2.  BACKGROUND 

In  this  section  we  first  discuss  the  details  of  Tor’s  internal  ar¬ 
chitecture,  focusing  on  how  data  is  sent  through  the  network  while 
covering  some  of  the  specifics  of  intra-relay  communication.  In 
addition,  we  review  related  work  on  improving  Tor’s  performance 
through  changes  to  the  transport  and  scheduling  mechanisms. 

2.1  Tor’s  Architecture 

The  Tor  overlay  network  consists  of  over  5,000  volunteer  relays, 
providing  anonymity  by  routing  data  through  the  network  to  the 
end  destination,  preventing  any  single  intermediary  from  learning 
the  identity  of  both  the  source  and  destination.  Clients  choose  three 
relays  -  a  guard,  middle,  and  exit  -  and  constructs  circuits  through 
them.  TCP  streams  can  then  be  multiplexed  over  the  circuit  to  the 
exit  relay  that  forwards  the  data  on  to  the  intended  destination.  Data 
transfered  through  the  circuit  is  packaged  into  512-byte  cells  and 
encrypted  in  a  layered  fashion,  using  shared  symmetric  keys  with 
each  of  the  three  relays. 

In  order  to  create  a  circuit,  the  client  sends  a  series  of  EXTEND 
cells  through  the  circuit,  each  of  which  notifies  the  current  last  hop 
to  extend  the  circuit  to  another  relay.  For  example,  the  client  sends 
an  EXTEND  cell  to  the  guard  telling  it  to  extend  to  the  middle.  Af¬ 
terwards  the  client  sends  another  EXTEND  to  the  middle  telling  it 
to  extend  the  circuit  to  the  exit.  The  relay,  on  receiving  an  EXTEND 
cell,  will  establish  a  channel  to  the  next  relay  if  one  does  not  al¬ 
ready  exist.  Cells  from  all  circuits  between  the  two  relays  get  trans¬ 
fered  over  this  channel,  which  is  responsible  for  in-order  delivery 
and,  ideally,  providing  secure  communication  from  potential  eaves¬ 
droppers.  Tor  uses  a  TLS  channel  with  a  single  TCP  connection 
between  the  relays  for  in-order  delivery  and  uses  TLS  to  encrypt 
and  authenticate  all  traffic. 

Figure  1  shows  the  internal  cell  processing  architecture  of  a  Tor 
relay.  Data  is  read  from  the  channel  and  stored  in  an  internal  input 
buffer.  Once  enough  data  has  been  read  (e.g.  a  full  TLS  record), 
the  data  is  decrypted  and  the  cells  are  sent  to  their  respective  circuit 
queues.  Once  a  channel  is  able  to  write,  it  uses  a  prioritized  circuit 
scheduler  to  select  from  all  circuits  travelling  through  the  channel. 
Once  a  circuit  is  selected,  it’s  queue  is  flushed  to  the  output  buffer 
on  the  channel.  Once  the  output  buffer  has  enough  data  it  is  then 
sent  out  over  the  channel  to  the  next  relay. 


Figure  1:  Internal  architecture  of  cell  processing  inside  a  Tor  relay. 
All  cells  from  a  given  relay  arrive  on  a  single  channel,  and  are  then 
de-multiplexed  to  circuit  queues  for  processing.  Each  circuit  queue 
is  emptied  into  the  channel  associated  with  the  next  hop  relay 

2.2  Related  Work 

The  Tor  network  [7]  has  been  very  succesful  in  attracting  users, 
servicing  millions  of  clients  daily.  Since  the  network  consists  of 
volunteer  relays  with  a  very  large  user  base,  network  resources  such 
as  bandwidth  and  processing  are  scarce.  In  order  to  sufficiently 
handle  the  increasing  usage,  a  large  body  of  research  has  been  done 
looking  into  potential  ways  to  lower  latency  and  increase  perfor¬ 
mance  to  the  end  users.  These  proposals  range  from  dealing  with 
the  limitations  of  TCP  [20,  18,  17],  better  utilization  of  network 
resources  [4,  5],  prioritizing  web  traffic  over  bulk  data  [3,  21,  14], 
improving  ways  of  dealing  with  congestion  [22,  4,  2],  and  adding 
incentives  to  act  as  a  relay  [12,  13].  We  briefly  cover  the  details  of 
some  of  the  research  most  relevant  to  the  work  in  this  paper. 
TCP-over-DTLS:  Reardon  and  Goldberg  [20]  were  the  first  to 
measure  and  document  several  potential  performance  limitations 
due  to  the  use  of  a  single  TCP  connection  for  each  channel,  includ¬ 
ing  head-of-line  blocking,  fate-sharing  and  buffering  delay.  In  an 
attempt  to  circumvent  these  problems,  they  proposed  using  a  Data¬ 
gram  TLS  (DTLS)  transport  channel  between  relays  to  tunnel  TCP 
connections  through.  In  addition,  each  circuit  gets  its  own  (user- 
space)  TCP  connection  instead  of  forcing  all  circuits  to  share  a  sin¬ 
gle  (kernel)  TCP  connection.  Reardon  and  Goldberg  used  micro¬ 
benchmarking  to  show  that  this  design  could  significantly  reduce 
the  impact  of  these  problems. 

uTLS/uTCP:  To  address  head-of-line  blocking  while  avoiding  prob¬ 
lems  associated  with  implementing  a  user-space  TCP  stack  Nowlan 
et  a/.proposed  to  replace  TLS  channels  with  uTLS/uTCP  [18,  19], 
a  variant  in  the  kernel  implementation  of  TCP,  allowing  Tor  to  peek 
into  the  input  buffer  for  full  TLS  records,  decrypt  them,  and  process 
them  if  the  cell  is  the  next  one  for  some  associated  circuit.  They 
evaluated  this  implementation  using  a  “circuit-dumbell”  topology, 
showing  that  head-of-line  blocking  was  significantly  reduced. 
EWMA:  Tang  and  Goldberg  developed  a  prioritized  circuit  sched¬ 
uler  [21]  using  the  exponential  weighted  moving  average  (EWMA) 
calculated  on  how  often  cells  are  sent  on  a  circuit  in  an  attempt  to 
be  able  to  prioritize  interactive  traffic  over  bulk  traffic  (e.g.  Bit- 
Torrent).  This  is  currently  the  scheduler  used  in  the  live  Tor  soft¬ 
ware.  Jansen  and  Hopper  [11]  used  the  Shadow  simulator  to  evalu¬ 
ate  EWMA  scheduling  on  a  large-scale  network  and  found  that  the 
benefits  were  highly  dependent  on  the  network  load. 

DiffTor:  AlSabah,  Bauer,  and  Goldberg  [3]  develop  a  novel  ap¬ 
proach  in  using  classifiers  to  more  accurately  distinguish  between 
web,  bulk,  and  streaming  traffic.  Using  their  highly  accurate  classi¬ 
fiers,  quality  of  service  (QoS)  algorithms  can  be  used  to  give  more 
opportune  treatment  to  web  traffic.  The  evaluation  used  throttling 
of  bulk  traffic  to  improve  the  throughput  of  other  traffic  classes,  as 
suggested  by  Jansen  et  al.  [14]. 

Traffic  Splitting:  AlSabah  et  a/. propose  Conflux,  a  dynamic  load- 


balancing  algorithm  [2]  that  can  split  traffic  on  one  stream  across 
multiple  circuits  in  an  attempt  to  avoid  congestion  in  the  Tor  net¬ 
work.  By  embedding  sequence  numbers  in  the  cell,  the  exit  can  re¬ 
assemble  the  stream  from  the  multiple  circuits  to  provide  in-order 
delivery  to  the  end  host. 

Torchestra:  To  prevent  bulk  circuits  from  interfering  with  web  cir¬ 
cuits,  Gopal  and  Heninger  [8]  developed  a  new  channel,  Torches¬ 
tra,  that  creates  two  TLS  connections  between  each  relay,  one  re¬ 
served  for  web  circuits  and  the  other  for  bulk.  This  prevents  head- 
of-line  blocking  that  might  be  caused  by  bulk  traffic  from  interfer¬ 
ing  with  web  traffic.  The  paper  evaluates  the  Torchestra  channel 
in  a  “circuit-dumbell”  topology  and  shows  that  time  to  first  byte 
and  total  download  time  for  “interactive”  streams  decrease,  while 
“heavy”  streams  do  not  see  a  significant  change  in  performance. 
PCTCP:  Similar  to  TCP-over-DTLS,  AlSabah  and  Goldberg  [5] 
propose  dedicating  a  separate  TCP  connection  to  each  circuit  and 
replacing  the  TLS  session  with  an  IPSEC  tunnel  that  can  then  carry 
all  the  connections  without  letting  an  adversary  learn  circuit  spe¬ 
cific  information  from  monitoring  the  different  connections.  This 
has  the  advantage  of  eliminating  the  reliance  on  user-space  TCP 
stacks,  leading  to  reduced  implementation  complexity  and  improved 
performance.  However,  as  we  show  in  the  next  section,  the  use  of  a 
kernel-provided  socket  for  every  circuit  makes  it  possible  to  launch 
attacks  that  attempt  to  exhaust  this  resource  at  a  targeted  relay. 
KIST:  Jansen  et  a/.  [10]  show  that  cells  spend  a  large  amount  of 
time  in  the  kernel  output  buffer,  causing  unneeded  congestion  and 
severely  limiting  the  effect  of  prioritization  in  Tor.  They  introduce  a 
new  algorithm  KIST  with  two  main  components:  global  scheduling 
across  all  writable  circuits  is  done  fixing  circuit  prioritization,  and 
an  autotuning  algorithm  that  can  dynamically  determine  how  much 
data  should  be  written  to  the  kernel.  This  allows  data  to  stay  inter¬ 
nal  to  Tor  for  longer,  allowing  it  to  make  smarter  scheduling  deci¬ 
sions  than  simply  dumping  everything  it  can  to  the  kernel,  which 
operates  in  a  FIFO  manner. 

3.  SOCKET  EXHAUSTION  ATTACKS 

This  section  discusses  the  extent  to  which  Tor  is  vulnerable  to 
socket  descriptor  exhaustion  attacks  that  may  lead  to  reductions  in 
relay  availability  and  client  anonymity,  explains  how  PCTCP  cre¬ 
ates  a  new  attack  surface  with  respect  to  socket  exhaustion,  and 
demonstrates  how  socket  resource  usage  harms  relay  stability.  The 
attacks  in  this  section  motivate  the  need  for  the  intelligent  manage¬ 
ment  of  sockets  in  Tor,  which  is  the  focus  of  Sections  4  and  5. 

3.1  Sockets  in  Tor 

On  modern  operating  systems,  file  descriptors  are  a  scarce  re¬ 
source  that  the  kernel  must  manage  and  allocate  diligently.  On 
Linux,  for  example,  soft  and  hard  file  limits  are  used  to  restrict 
the  number  of  open  file  descriptors  that  any  process  may  have  open 
at  one  time.  Once  a  process  exceeds  this  limit,  any  system  call 
that  attempts  to  open  a  new  file  descriptor  will  fail  and  the  kernel 
will  return  an  EMFILE  error  code  indicating  too  many  open  files. 
Since  sockets  are  a  specific  type  of  file  descriptor,  this  same  is¬ 
sue  can  arise  if  a  process  opens  sockets  in  excess  of  the  file  limit. 
Aware  of  this  limitation.  Tor  internally  utilizes  its  own  connec¬ 
tion  limit.  For  relays  running  on  Linux  and  BSD,  an  internal  vari¬ 
able  ConnLimit  is  set  to  the  maximum  limit  as  returned  by  the 
getrlimit  ()  system  call;  the  ConnLimit  is  set  to  ahard  coded 
value  of  15,000  on  all  other  operating  systems.  Each  time  a  socket 
is  opened  and  closed,  an  internal  counter  is  incremented  and  decre¬ 
mented;  if  a  tor_connect  ()  function  call  is  made  when  this 
counter  is  above  the  ConnLimit,  it  preemptively  returns  an  error 
rather  than  waiting  for  one  from  the  connect  system  call. 


3.2  Attack  Strategies 

There  are  several  cases  to  consider  in  order  to  exploit  open  sock¬ 
ets  as  a  relay  attack  vector.  Relay  operators  may  be:  (i)  running 
Linux  with  the  default  maximum  descriptor  limit  of  4096;  (ii)  run¬ 
ning  Linux  with  a  custom  descriptor  limit  or  running  a  non-Linux 
OS  with  the  hard-coded  ConnLimit  of  15,000;  and  (iii)  running 
any  OS  and  allowing  unlimited  descriptors.  We  note  that  setting 
a  custom  limit  generally  requires  root  privileges,  although  it  does 
not  require  that  Tor  itself  be  run  as  the  root  user.  Also  note  that 
each  Tor  relay  connects  to  every  other  relay  with  which  it  com¬ 
municates,  leading  to  potentially  thousands  of  open  sockets  under 
normal  operation.  In  any  case,  the  adversary’s  primary  goal  is  to 
cause  a  victim  relay  to  open  as  many  sockets  as  possible. 
Consuming  Sockets  at  Exit  Relays:  In  order  to  consume  sockets 
at  an  exit  relay,  an  adversary  can  create  multiple  circuits  through  in¬ 
dependent  paths  and  request  TCP  streams  to  various  destinations. 
Ideally,  the  adversary  would  select  services  that  use  persistent  con¬ 
nections  to  ensure  that  the  exit  holds  open  the  sockets.  The  adver¬ 
sary  could  then  send  the  minimal  amount  required  to  keep  the  con¬ 
nections  active.  Although  the  adversary  remains  anonymous  (be¬ 
cause  the  victim  exit  relay  does  not  learn  the  adversary’s  identity), 
keeping  persistent  connections  active  so  that  they  are  not  closed  by 
the  exit  will  come  at  a  bandwidth  cost. 

Consuming  Sockets  at  Any  Relay:  Bandwidth  may  be  traded  for 
CPU  and  memory  by  using  Tor  itself  to  create  the  persistent  con¬ 
nections,  in  which  case  relays  in  any  position  may  be  targeted.  This 
could  be  achieved  by  an  adversary  connecting  several  Tor  client  in¬ 
stances  directly  to  a  victim  relay;  each  such  connection  would  con¬ 
sume  a  socket  descriptor.  However,  the  victim  would  be  able  to 
determine  the  adversary’s  IP  address  (i.e.,  identity).  The  attack  can 
also  be  done  anonymously.  The  basic  mechanism  to  do  so  was  out¬ 
lined  in  The  Sniper  Attack,  Section  II-C-3  [15],  where  it  was  used 
in  a  relay  memory  consumption  denial  of  service  attack.  Here,  we 
use  similar  techniques  to  anonymously  consume  sockets  descrip¬ 
tors  at  the  victim. 

The  attack  is  depicted  in  Figure  2a.  First,  the  adversary  launches 
several  Tor  client  sybils.  Ai  and  As  are  used  to  build  indepen¬ 
dent  circuits  through  Gi,  Mi,  Ei  and  G2,  M2,  E2,  respectively, 
following  normal  path  selection  policies.  These  sybil  clients  also 
configure  a  SocksPort  to  allow  connections  from  other  appli¬ 
cations.  Then,  A2,  A3,  and  A4  use  either  the  Socks4Proxy 
or  SocksSProxy  options  to  extend  new  circuits  to  a  victim  V 
through  the  Tor  circuit  built  by  Ai.  The  Ag,  A7,  and  Ag  sybils 
similarly  extend  circuits  to  V  through  the  circuit  built  by  Ag.  Each 
Tor  sybil  client  will  create  a  new  tunneled  channel  to  V,  causing 
the  exits  Ei  and  E2  to  establish  new  TCP  connections  with  V . 

Each  new  TCP  connection  to  V  will  consume  a  socket  descriptor 
at  the  victim  relay.  When  using  either  the  Socks4Proxy  or  the 
SocksSProxy  options,  the  Tor  software  manual  states  that  “Tor 
will  make  all  OR  connections  through  the  SOCKS  [4,5]  proxy  at 
host:port  (or  host:  1080  if  port  is  not  specified).”  We  also  success¬ 
fully  verified  this  behavior  using  Shadow.  This  attack  allows  an  ad¬ 
versary  to  consume  roughly  one  socket  for  every  sybil  client,  while 
remaining  anonymous  from  the  perspective  of  the  victim.  Further, 
the  exits  Ei  and  E2  will  be  blamed  if  any  misbehavior  is  suspected, 
who  themselves  will  be  unable  to  discover  the  identity  of  the  true 
attacker.  If  Tor  were  to  use  a  new  socket  for  every  circuit,  as  sug¬ 
gested  by  PCTCP  [5],  then  the  adversary  could  effectively  launch 
a  similar  attack  with  only  a  single  Tor  client. 

Consuming  Sockets  with  PCTCP:  PCTCP  may  potentially  of¬ 
fer  performance  gains  by  dedicating  a  separate  TCP  connection 
for  every  circuit.  However,  PCTCP  widens  the  attack  surface  and 
reduces  the  cost  of  the  anonymous  socket  exhaustion  attack  dis- 
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Figure  2:  Showing  (a)  anonymous  socket  exhaustion  attack  using  client  sybils  (b)  throughput  of  relay  when  launching  a  socket  exhaustion 
attack  via  circuit  creation  with  shaded  region  representing  when  the  attack  was  being  launched  (c)  memory  consumption  as  a  process  opens 
sockets  using  libevent. 


cussed  above.  If  all  relays  use  PCTCP,  an  adversary  may  sim¬ 
ply  send  EXTEND  cells  to  a  victim  relay  through  any  other  relay 
in  the  network,  causing  a  new  circuit  to  be  built  and  therefore  a 
socket  descriptor  opened  at  the  victim.  Since  the  cells  are  being 
forwarded  from  other  relays  in  the  network,  the  victim  relay  will 
not  be  able  to  determine  who  is  originating  the  attack.  Further,  the 
adversary  gets  long-term  persistent  connections  cheaply  with  the 
use  of  the  Tor  conflg  options  MaxCliGntCircuitsPending 
and  Circuit  IdleTimeout.  The  complexity  of  the  socket  ex¬ 
haustion  attack  is  reduced  and  the  adversary  no  longer  needs  to 
launch  the  tunneled  sybil  attack  in  order  to  anonymously  consume 
the  victim’s  sockets.  By  opening  circuits  with  a  single  client,  the  at¬ 
tacker  will  cause  the  victim’s  number  of  open  connections  to  reach 
the  ConnLimit  or  may  cause  relay  stability  problems  (or  both). 

3.3  Effects  of  Socket  Exhaustion 

Socket  exhaustion  attacks  may  lead  to  either  reduced  relay  avail¬ 
ability  and  client  anonymity  if  there  is  a  descriptor  limit  in  place, 
or  may  harm  relay  stability  if  either  there  is  no  limit  or  the  limit  is 
too  high.  We  now  explore  these  effects. 

Limited  Sockets:  If  there  is  a  limit  in  place,  then  opening  sockets 
will  consume  the  shared  descriptor  resource.  An  adversary  that  can 
consume  all  sockets  on  a  relay  will  have  effectively  made  that  re¬ 
lay  unresponsive  to  connections  by  other  honest  nodes  due  to  Tor’s 
ConnLimit  mechanism.  If  the  adversary  can  persistently  main¬ 
tain  this  state  over  time,  then  it  has  effectively  disabled  the  relay  by 
preventing  it  from  making  new  connections  to  other  Tor  nodes. 

We  ran  a  socket  consumption  attack  against  both  vanilla  Tor  and 
PCTCP  using  Shadow  and  the  netwoik  model  described  in  Sec¬ 
tion  5.1.  Our  attacker  node  created  1000  circuits  every  6  seconds 
through  a  victim  relay,  starting  at  time  1800  and  ending  at  time 
3600.  Figure  2b  shows  the  victim  relay’s  throughput  over  time  as 
new  circuits  are  built  and  victim  sockets  are  consumed.  After  con¬ 
suming  all  available  sockets,  the  victim  relay’s  throughput  drops 
close  to  0  as  old  circuits  are  destroyed,  effectively  disabling  the 
relay.  This,  in  turn,  will  move  honest  clients’  traffic  away  from 
fhe  relay  and  onfo  other  relays  in  the  network.  If  the  adversary  is 
running  relays,  then  it  has  increased  the  probability  that  its  relays 
will  be  chosen  by  clients  and  therefore  has  improved  its  ability  to 
perform  end-to-end  traffic  correlation  [16],  After  the  attacker  stops 
the  attack  at  time  3600,  the  victim  relay’s  throughput  recovers  as 
clients’  are  again  able  to  successfully  create  circuits  through  it. 
Unlimited  Sockets;  One  potential  solution  to  the  availability  and 
anonymity  concerns  caused  by  a  file  descriptor  limit  is  to  remove 
the  limit  (i.e.,  set  an  unlimited  limit),  meaning  that  ConnLimit 


gets  set  to  2^^  or  2®^.  At  the  lesser  of  the  two  values,  and  as¬ 
suming  that  an  adversary  can  consume  one  socket  for  every  512- 
byte  Tor  cell  it  sends  to  a  victim,  it  would  take  around  2  TiB  of 
network  bandwidth  to  cause  the  victim  to  reach  its  ConnLimit. 
However,  even  if  the  adversary  cannot  cause  the  relay  to  reach  its 
ConnLimit,  opening  and  maintaining  sockets  will  still  drain  the 
relay’s  physical  resources,  and  will  increase  the  processing  time 
associated  with  socket-based  operations  in  the  kernel.  By  remov¬ 
ing  the  open  descriptor  limit,  a  relay  becomes  vulnerable  to  per¬ 
formance  degradation,  increased  memory  consumption,  and  an  in¬ 
creased  risk  of  being  killed  by  the  kernel  or  otherwise  crashing.  An 
adversary  may  cause  these  effects  through  the  same  attacks  it  uses 
against  a  relay  with  a  default  or  custom  descriptor  limit. 

Figure  2c  shows  how  memoiy  consumption  increases  with  the 
number  of  open  sockets  as  a  process  opens  over  a  million  sockets. 
We  demonstrate  other  performance  effects  using  a  private  Tor  net¬ 
work  of  5  machines  in  our  lab.  Our  network  consisted  of  4  relays 
total  (one  directory  authority),  each  running  on  a  different  machine. 
We  configured  each  relay  to  run  Tor  vO .  2 . 5 . 2-alpha  modified 
to  use  a  simplified  version  of  PCTCP  that  creates  a  new  OR  connec¬ 
tion  for  every  new  circuit.  We  then  launched  a  simple  file  server,  a 
Tor  client,  and  5  file  clients  on  the  same  machine  as  the  directory 
authority.  The  file  clients  download  arbitrary  data  from  the  server 
through  a  specified  path  of  the  non-directory  relays,  always  using 
the  same  relay  in  each  of  the  entry,  middle,  and  exit  positions.  The 
final  machine  ran  our  Tor  attacker  client  that  we  configured  to  ac¬ 
cept  localhost  coruiections  over  the  ControlPort.  We  then  used 
a  custom  python  script  to  repeatedly:  (1)  request  that  KXK)  new  cir¬ 
cuits  be  created  by  the  Tor  client,  and  (2)  pause  for  6  seconds.  Each 
relay  tracked  socket  and  bandwidth  statistics;  we  use  throughput 
and  the  time  to  open  new  sockets  to  measure  performance  degrada¬ 
tion  effects  and  relay  instability. 

The  stability  effects  for  the  middle  relay  are  shown  in  Figure  3. 
The  attack  ran  for  just  over  2500  seconds  and  caused  the  middle 
relay  to  successfully  open  more  than  50  thousand  sockets.  We  no¬ 
ticed  that  our  relays  were  unable  to  create  more  sockets  due  to  port 
allocation  problems,  meaning  that  (1)  we  were  unable  to  measure 
the  potentially  more  serious  performance  degradation  effects  that 
occur  when  the  socket  count  exceeds  65  thousand,  and  (2)  unlim¬ 
ited  sockets  may  not  be  practically  attainable  due  to  port  exhaustion 
between  a  pair  of  relays.  Figure  3a  shows  throughput  over  time  and 
Figure  3b  shows  a  negative  correlation  of  bandwidth  to  the  number 
of  open  sockets;  both  of  these  figures  show  a  drop  of  more  than  750 
KiB/s  in  the  60  second  moving  average  throughput  during  our  ex¬ 
periment.  Processing  overhead  during  socket  system  calls  over 
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Figure  3:  Showing  (a)  throughput  over  time  (b)  linear  regression  correlating  throughput  to  the  number  of  open  sockets  (c)  kernel  time  to 
open  new  sockets  over  time  (d)  linear  regression  correlating  kernel  time  to  open  a  new  socket  to  the  number  of  open  sockets. 


time  is  shown  in  Figure  3c,  and  the  correlation  to  the  number  of 
open  sockets  is  shown  in  Figure  3d;  both  of  these  figures  clearly 
indicate  that  increases  in  kernel  processing  time  can  be  expected  as 
the  number  of  open  sockets  increases.  Although  the  absolute  time 
to  opien  sockets  is  relatively  small,  it  more  than  doubled  during  our 
experiment;  we  believe  this  is  a  strong  indication  of  performance 
degradation  in  the  kernel  and  that  increased  processing  delays  in 
other  kernel  socket  processing  functions  are  iikely  as  well. 

4.  IMUX 

This  section  explores  a  new  algorithm  that  takes  advantage  of 
multiple  connections  while  respecting  the  ConnLimit  imposed 
by  Tor  and  preventing  the  attacks  discussed  above  in  Section  3. 
Both  Torchestra  and  PCTCP  can  be  seen  as  heuristically  derived 
instances  of  a  more  general  resource  allocation  scheme  with  two 
components,  one  determining  how  many  connections  to  open  be¬ 
tween  relays,  and  the  second  in  selecting  a  connection  to  schedule 
cells  on.  Torchestra’s  heuristic  is  to  fix  the  number  of  connections 
at  two,  designating  one  for  light  traffic  and  the  other  for  heavy,  then 
scheduling  cells  based  on  the  traffic  classification  of  each  circuit. 
PCTCP  keeps  a  connection  open  for  each  circuit  between  two  re¬ 
lays,  with  each  connection  devoted  to  a  single  circuit  that  schedules 
cells  on  it. 

While  there  is  a  demonstrable  advantage  to  being  able  to  open 
multiple  connections  between  two  communicating  relays,  it  is  im¬ 
portant  to  have  an  upper  limit  on  the  number  of  connections  al¬ 
lowed  to  prevent  anonymous  socket  exhaustion  attacks  against  re¬ 
lays,  as  shown  in  Section  3.  In  this  section  we  introduce  IMUX, 
a  new  heuristic  for  handling  multiple  connections  between  relays 
that  is  able  to  dynamically  manage  open  connections  while  taking 
into  consideration  the  internal  connection  limit  in  Tor. 

4. 1  Connection  Management 

Similar  to  PCTCP,  we  want  to  ensure  the  allocation  of  connec¬ 
tions  each  channel  has  is  proportional  to  the  number  of  active  cir¬ 
cuits  each  charmel  is  carrying,  dynamically  adjusting  as  circuits 
are  opened  and  closed  across  all  charmels  on  the  relay.  PCTCP  can 
easily  accomplish  this  by  dedicating  a  connection  to  each  circuit 
every  time  one  is  ojjened  or  closed,  but  since  IMUX  enforces  an 
upper  limit  on  connections,  the  connection  management  requires 
more  care,  especially  since  both  ends  of  the  channel  will  may  have 
different  upper  llmiK 

We  first  need  a  protocol  dictating  how  and  when  relays  can  open 
and  close  connections.  During  the  entire  time  the  channel  is  open, 
only  one  relay  is  allowed  to  open  cormectlons,  initially  set  to  the 
relay  that  creates  the  channel.  However,  at  any  time  either  relay 
may  close  a  connection  if  it  detects  the  number  of  open  sockets 
approaching  the  total  connection  limit.  When  a  relay  decides  to 


Algorithm  1  Function  to  determine  the  maximum  number  of  connections 
that  can  be  open  on  a  channel. 

1:  function  GETMAXCONNSfnconns,  ncircs) 

2:  totalCircs  <—  len{globalActiveCircList) 

3:  if  ncircs  is  0  or  totalCircs  is  0  then 

4:  return  1 

5:  end  if 

6:  frac  <—  ncircs /totalCircs 

7:  totalM axConns  ■(—  ConnLimit  ■  r 

8:  connsLeft  <—  totalMaxConns  —  n_open_sockets{) 

9:  maxconns  <—  frac  ■  totalM  axConns 

10:  maxconns  <—  M I N (maxconns,  nconns  •  2) 

11:  maxconns  <— AI  IN  (maxconns, 

nconns  +  connsLe  ft) 

12:  return  maxconns 

13:  end  function 


close  a  connection,  it  must  first  decide  which  connection  should  be 
closed.  To  pick  which  connection  to  close  the  algorithm  uses  three 
criteria  for  prioritizing  available  connections:  (1)  Always  pick  con¬ 
nections  that  haven’t  fully  opened  yet;  (2)  Among  connections  with 
state  OPENING,  pick  the  one  that  was  created  most  recently;  and 
(3)  If  all  cormectlons  are  open,  pick  the  one  that  was  least  recently 
used.  In  order  to  close  a  connection  C,  first  the  relay  must  make 
sure  the  relay  on  the  other  end  of  the  channel  is  aware  C  is  being 
closed  to  prevent  data  from  being  written  to  C  during  the  closing 
process.  Once  a  connection  C  is  chosen,  the  initiating  relay  sends 
out  an  empty  cell  with  a  new  command,  CLOSING_CONN_BEGIN, 
and  marks  C  for  close  to  prevent  any  more  data  from  being  written 
to  it  Once  the  responding  relay  on  the  other  end  of  the  channel 
receives  the  cell,  it  flushes  any  data  remaining  in  the  buffer  for  C, 
sends  back  another  CLOSING_CONN_END  cell  to  the  initiating  re¬ 
lay,  and  closes  the  connection.  Once  the  initiating  relay  receives 
the  CLOSING_CONN_END,  it  knows  that  it  has  receiv^  all  data 
and  is  then  able  to  proceed  closing  the  socket. 

Once  a  channel  has  been  established,  a  housekeeping  function  is 
called  on  the  channel  every  second  that  then  determines  whether  to 
open  or  close  any  connections.  The  function  to  determine  the  maxi¬ 
mum  number  of  connections  that  can  open  on  a  charmel  can  be  seen 
in  Algorithm  1 .  We  calculate  a  soft  upper  limit  on  the  total  number 
of  allowed  open  connections  on  the  relay  by  taking  ConnLimit 
and  multiplying  it  by  the  parameter  r  e  (0, 1).  ConnLimit  is  an 
internal  variable  that  determines  the  maximum  number  of  sockets 
allowed  to  be  open  on  the  relay.  On  Linux  based  relays  this  is  set 
by  calling  getrlimit  ( )  to  get  the  file  limif  on  the  machine,  oth¬ 
erwise  it  is  fixed  at  15,000.  The  parameter  t  is  a  threshold  value 
between  0  and  1  that  sets  a  soft  upper  limit  on  the  number  of  open 
connections.  Since  once  the  number  of  open  connections  exceeds 
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Figure  4:  The  three  different  connection  scheduling  algorithms  used  in  IMUX. 


ConnLimit  all  connect  ( )  calls  will  fail,  we  want  some  breath¬ 
ing  room  so  new  channels  and  connections  can  still  be  opened,  tem¬ 
porarily  going  past  the  soft  limit,  until  other  connections  can  be 
closed  to  bring  the  relay  back  under  the  limit.  To  calculate  the  limit 
for  the  channel  we  simply  take  this  soft  limit  on  the  total  of  all  open 
connections  and  multiply  it  by  the  fraction  of  active  circuits  using 
the  channel.  This  gives  us  an  upper  bound  on  the  connection  limit 
for  the  channel.  We  then  take  the  minimum  of  this  upper  limit  with 
the  number  of  current  open  connections  on  the  channel  multiplied 
by  two.  This  is  done  to  prevent  rapid  connection  opening  when  a 
channel  is  first  created,  particularly  when  the  other  relay  has  a  much 
lower  connection  limit.  Finally  we  take  the  minimum  of  that  cal¬ 
culation  with  the  number  of  open  connections  plus  the  number  of 
connections  that  can  be  opened  on  the  relay  before  hitting  the  soft 
limit  (that  could  be  negative,  signaling  that  connections  need  to  be 
closed).  Otherwise  the  channel  could  create  too  many  connections, 
driving  the  number  of  open  connections  past  ConnLimit. 

The  housekeeping  function  is  called  every  second  and  determines 
if  any  connections  need  to  be  opened  or  closed.  If  we  have  two  few 
connections  and,  based  on  the  protocol  discussed  in  the  previous 
paragraph,  the  relay  is  allowed  to  open  connections  on  this  channel, 
enough  connections  are  created  to  match  the  maximum  connections 
allowed.  If  we  have  too  many  connections  open,  we  simply  close 
enough  until  we  are  at  the  connection  limit.  We  use  the  previously 
discussed  protocol  for  selecting  and  closing  connections,  prioritiz¬ 
ing  newly  created  connections  in  an  attempt  to  close  unneeded  con¬ 
nections  before  the  TLS  handshake  is  started,  preventing  unneces¬ 
sary  overhead.  In  addition  to  the  housekeeping  function,  whenever 
a  channel  accepts  an  incoming  connection,  it  also  checks  to  see  if 
the  number  of  connections  exceeds  the  maximum  allowed  returned 
by  Algorithm  1 .  If  so,  it  simply  notifies  the  relay  at  the  other  end  of 
the  channel  that  it  is  closing  the  connection,  preventing  that  relay 
from  opening  any  more  connections  as  dictated  by  the  protocol. 

4.2  Connection  Scheduler 

For  connection  scheduling,  PCTCP  assigns  each  circuit  a  dedi¬ 
cated  connection  to  schedule  cells  on.  Torchestra  schedules  cells 
from  a  circuit  to  either  the  light  or  heavy  connection  depending  on 
how  much  data  is  being  sent  through  the  circuit.  A  circuit  starts 
out  on  the  light  connection,  and  if  at  some  point  its  FWMA  value 
crosses  a  threshold  it  is  switched  to  the  heavy  connection.  To  ac¬ 
commodate  this,  a  switching  protocol  is  introduced  so  the  relay 
downstream  can  be  notified  when  a  circuit  has  switched  and  on 
what  connection  it  can  expect  to  receive  cells. 

While  scheduling  cells  from  a  circuit  on  a  single  connection 
makes  in-order  delivery  easier  by  relying  on  TCP,  with  multiple 
connections  per  channel  it  is  not  necessary  to  do  so  and  in  fact  may 


not  be  optimal  to  keep  this  restraint.  Similar  to  the  uTLS  implemen¬ 
tation  in  Tor  [19]  and  Conflux  [2],  we  embed  an  8-byte  sequence 
number  in  the  relay  header  of  all  cells.  This  allows  the  channel  to 
schedule  cells  across  multiple  connections  that  then  get  reordered 
at  the  other  end  of  the  channel.  With  sequence  numbers  in  place  and 
the  capability  to  schedule  cells  from  a  circuit  across  multiple  con¬ 
nections,  we  can  evaluate  different  scheduling  algorithms  attempt¬ 
ing  to  increase  throughput  or  “fairness”,  for  example,  where  low 
traffic  circuits  have  better  performance  than  high  traffic  ones.  We 
will  briefly  cover  the  different  algorithms  and  heuristics  we  can  use. 
Circuit  Round  Robin:  The  first  scheduler,  shown  in  Figure  4a, 
emulates  PCTCP  by  assigning  each  circuit  a  single  connection  to 
transmit  its  cells.  When  circuits  are  added  to  a  channel,  the  sched¬ 
uler  iterates  round  robin  style  through  the  circuit  list,  assigning  cir¬ 
cuits  to  successive  connections.  If  there  are  more  circuits  than  con¬ 
nections  some  circuits  will  share  a  single  connection.  When  a  con¬ 
nection  is  closed,  any  circuit  assigned  to  it  will  be  given  a  new  one 
to  schedule  through,  with  the  connections  remaining  iterated  in  the 
same  round  robin  style  as  before. 

EWMA  Mapping:  Internal  to  Tor  is  a  circuit  scheduling  algorithm 
proposed  by  Tang  and  Goldberg  [21]  that  uses  an  exponential  mov¬ 
ing  weight  average  (FWMA)  algorithm  to  compute  how  “noisy” 
circuits  are  being,  and  then  schedule  them  from  quietest  to  loudest 
when  choosing  what  circuits  to  flush.  Using  the  same  algorithm, 
we  compute  the  FWMA  value  for  each  connection  as  well  as  the 
circuits.  Then,  as  seen  in  Figure  4b,  the  circuits  and  connections 
are  ordered  from  lowest  to  highest  FWMA  value  and  we  attempt  to 
map  the  circuits  to  a  connection  with  a  similar  FWMA  value.  More 
specifically,  after  sorting  the  circuits,  we  take  the  rank  of  the  circuit 
1  <  ri  <  ncircs  and  compute  the  percentile  pi  =  .  We 

do  the  same  thing  with  the  connections  computing  their  percentiles 
denoted  Cj .  Then  to  determine  which  connection  to  map  a  circuit 
to,  we  pick  the  connection  j  such  that  Cj-i  <  Pi  <  Cj. 

Shortest  Queue:  While  the  FWMA  mapping  scheduler  is  built 
around  the  idea  of  “fairness”,  where  we  penalize  high  usage  circuits 
by  scheduling  them  on  busier  connections,  we  can  construct  an  al¬ 
gorithm  aimed  at  increasing  overall  throughput  by  always  schedul¬ 
ing  cells  in  an  opportunistic  manner.  The  shortest  queue  scheduler, 
shown  in  Figure  4c,  calculates  the  queue  length  of  each  connection 
and  schedules  cells  on  connections  with  the  shortest  queue.  This  is 
done  by  taking  the  length  of  the  internal  output  buffer  queue  that 
Tor  keeps  for  each  connections,  and  adding  it  with  the  kernel  TCP 
buffer  that  each  socket  has;  this  is  obtained  using  the  ioctl'  func¬ 
tion  call  and  passing  it  the  socket  descriptor  and  TIOCOUTQ. 


'http : / /man? . org/ linux /man-page s/man2 /ioctl . 
2 . html 


4.3  KIST:  Kernel-Informed  Socket  Transport 

Recent  research  by  Jansen  et  flZ.[10]  showed  that  by  minimiz¬ 
ing  the  amount  of  data  that  gets  buffered  inside  the  kernel  and  in¬ 
stead  keeping  it  local  to  Tor,  better  scheduling  decisions  can  be 
made  and  connections  can  be  written  in  an  opportunistic  manner 
increasing  performance.  The  two  main  components  of  the  algo¬ 
rithm  are  global  scheduling  and  autotuning.  In  vanilla  Tor,  libevent 
iterates  through  the  connections  in  a  round  robin  fashion,  notifying 
Tor  that  the  connection  can  write.  When  Tor  receives  this  notifica¬ 
tion,  it  performs  circuit  scheduling  only  on  circuits  associated  with 
the  connection.  Global  scheduling  takes  a  list  of  all  connections 
that  can  write,  and  schedules  between  circuits  associated  with  ev¬ 
ery  single  connection,  making  circuit  prioritization  more  effective. 
Once  a  circuit  is  chosen  to  write  to  a  connection,  autotuning  then 
determines  how  much  data  should  be  flushed  to  the  output  buffer 
and  onto  the  kernel.  By  using  a  variety  of  socket  and  TCP  statistics, 
it  attempts  to  write  just  enough  to  keep  data  in  the  socket  buffer  at 
all  times,  without  flushing  everything,  giving  more  control  to  Tor 
for  scheduling  decisions. 

KIST  can  be  used  in  parallel  with  Torchestra,  PCTCP,  and  IMUX, 
and  also  as  a  connection  manager  within  the  IMUX  algorithm.  Af¬ 
ter  KIST  selects  a  circuit  and  decides  how  much  data  to  flush, 
Torchestra  and  PCTCP  make  their  own  determination  on  which 
connection  to  write  to  based  on  their  internal  heuristics.  IMUX  can 
also  take  into  account  the  connection  selected  by  the  KIST  algo¬ 
rithm  and  use  that  to  schedule  cells  from  the  circuit.  Similar  to  the 
other  connection  schedulers,  this  means  that  circuits  can  be  sched¬ 
uled  across  different  connections  in  a  more  opportunistic  fashion. 

5.  EVALUATION 

In  this  section  we  discuss  our  experimental  setup,  the  details 
of  our  implementations  of  Torchestra  and  PCTCP  (for  comparison 
with  IMUX),  evaluate  how  the  dynamic  connection  management  in 
IMUX  is  able  to  protect  against  potential  denial  of  service  attacks 
by  limiting  the  number  of  open  connections,  and  finally  compare 
performance  across  the  multiple  connection  schedulers,  along  with 
both  Torchestra  and  PCTCP. 

5.1  Experimental  Setup 

We  perform  experiments  in  Shadow  [1,  1 1],  a  discrete  event  net¬ 
work  simulator  capable  of  running  real  Tor  code  in  a  simulated  net¬ 
work.  Shadow  allows  us  to  create  large-scale  network  deployments 
that  can  be  run  locally  and  privately  on  a  single  machine,  avoiding 
privacy  risks  associated  with  running  on  the  public  network  that 
many  active  users  rely  on  for  anonymity.  Because  Shadow  runs 
the  Tor  software,  we  are  able  to  implement  our  performance  en¬ 
hancements  as  patches  to  Tor.^  We  also  expect  that  Tor  running 
in  Shadow  will  exhibit  realistic  application-level  performance  ef¬ 
fects  including  those  studied  in  this  paper.  Finally,  Shadow  is  deter¬ 
ministic;  therefore  our  results  may  be  independently  reproduced  by 
other  researchers.  Shadow  also  enables  us  to  isolate  performance 
effects  and  attribute  them  to  a  specific  set  of  configurations,  such 
as  variations  in  scheduling  algorithms  or  parameters.  This  isolation 
means  that  our  performance  comparisons  are  meaningful  indepen¬ 
dent  of  our  ability  to  precisely  model  the  complex  behaviors  of  the 
public  Tor  network. 

We  initialize  a  Tor  network  topology  and  node  configuration  and 
use  it  as  a  common  Shadow  deployment  base  for  all  experiments 
in  this  section.  For  this  common  base,  we  use  the  large  Tor  con¬ 
figuration  that  is  distributed  with  Shadow.^  The  techniques  for  pro- 

^We  modified  Tor  vO  .  2 . 5 . 2-alpha. 

^We  use  Shadow  vl .  9 . 2,  the  latest  release  as  of  this  writing. 


ducing  this  model  are  discussed  in  detail  in  [9]  and  updated  in  [10]. 
It  consists  of  500  relays,  1350  web  clients,  150  bulk  clients,  300 
perf  clients  and  500  file  servers.  Web  clients  repeatedly  download 
a  320  KiB  file  while  pausing  between  1  to  60  seconds  after  every 
download.  Bulk  clients  continuously  download  5  MiB  files  with  no 
pausing  between  downloads.  The  perf  clients  download  a  file  every 
60  seconds,  with  100  downloading  a  50  KiB  file,  100  downloading 
a  1  MiB  file,  and  100  download  a  5  MiB  file. 

The  Shadow  perf  clients  are  configured  to  mimic  the  behavior  of 
the  TorPerf  clients  that  run  in  the  public  Tor  network  to  measure 
Tor  performance  over  time.  Since  the  Shadow  and  Tor  perf  clients 
download  files  of  the  same  size,  we  verified  that  the  performance 
characteristics  in  our  Shadow  model  were  reasonably  similar  to  the 
public  network. 

5.2  Implementations 

In  the  original  Torchestra  design  discussed  in  [8],  the  algorithm 
uses  EWMA  in  an  attempt  to  classify  each  circuit  as  “lighf’  or 
“heavy”.  Since  the  EWMA  value  will  depend  on  many  external 
network  factors  (available  bandwidth,  network  load,  congestion, 
etc.),  the  algorithm  uses  the  average  EWMA  value  for  the  light  and 
heavy  connection  as  benchmarks.  Using  separate  threshold  values 
for  the  light  and  heavy  connection,  when  a  circuit  either  goes  above 
or  below  the  average  multiplied  by  the  threshold  the  circuit  is  re¬ 
classified  and  is  swapped  to  the  other  connection.  The  issue  with 
this,  as  noted  in  [3],  is  that  web  traffic  tends  to  be  bursty  causing 
temporary  spikes  in  circuit  EWMA  values.  When  this  occurs  it  in¬ 
creases  the  circuit’s  chance  of  becoming  misclassified  and  assigned 
to  the  bulk  connection.  Doing  so  will  in  turn  decrease  the  average 
EWMA  of  both  the  light  and  bulk  connections,  making  it  easier 
for  circuits  to  exceed  the  light  connections  threshold  and  harder 
for  circuits  to  drop  below  the  heavy  connection  threshold,  meaning 
web  circuits  that  get  misclassified  temporary  will  find  it  more  dif¬ 
ficult  to  get  reassigned  to  the  light  connection.  A  better  approach 
would  be  to  use  a  more  complex  classifier  such  as  DiffTor  [3]  to  de¬ 
termine  if  a  circuit  was  carrying  web  or  bulk  traffic.  Eor  our  imple¬ 
mentation,  we  have  Torchestra  use  an  idealized  version  of  DiffTor 
where  relays  have  perfect  information  about  circuit  classification. 
When  a  circuit  is  first  created  by  a  client,  the  client  sends  either  a 
CELL_TRAFFIC_WEB  or  CELL_TRAFFIC_BULK  cell  notifying 
each  relay  of  the  type  of  traffic  that  will  be  sent  through  the  circuit. 
Obviously  this  would  be  unrealistic  to  have  in  the  live  Tor  network, 
but  it  lets  us  examine  Torchestra  under  an  ideal  situation. 

Eor  PCTCP  there  are  two  main  components  of  the  algorithm. 
First  is  the  dedicated  connection  that  gets  assigned  to  each  circuit, 
and  the  second  is  replacing  per  connection  TLS  encryption  with  a 
single  IPSec  layer  between  the  relays,  preventing  an  attacker  from 
monitoring  a  single  TCP  connection  to  learn  information  about  a 
circuit.  For  our  purposes  we  are  interested  in  the  first  component, 
performance  gains  from  dedicating  a  connection  to  each  circuit. 
The  IPSec  has  some  potential  to  increase  performance,  since  each 
connection  no  longer  requires  a  TLS  handshake  that  adds  some 
overhead,  but  there  are  a  few  obstacles  noted  in  [5]  that  could  hin¬ 
der  deployment.  Furthermore,  it  can  be  deployed  alongside  any  al¬ 
gorithm  looking  to  open  multiple  connections  between  relays.  For 
simplicity,  our  PCTCP  implementation  simply  opens  a  new  TLS 
connection  for  each  circuit  created  that  will  use  the  new  connec¬ 
tion  exclusively  for  transferring  cells. 

5.3  Connection  Management 

One  of  the  main  goals  of  the  dynamic  connection  manager  in 
IMUX  is  to  avoid  denial  of  service  attacks  by  consuming  all  avail¬ 
able  open  sockets.  To  achieve  this  IMUX  has  a  soft  limit  that  caps 


Figure  5:  Performance  comparison  of  IMUX  connection  schedulers 


the  total  number  of  connections  at  ConnLimit  -r,  where  r  is  a 
parameter  between  0  and  1 .  If  this  is  set  too  low,  we  may  lose  out 
in  potential  performance  gains  that  go  unrealized,  while  if  it  is  too 
high  we  risk  exceeding  the  hard  limit  ConnLimit,  causing  new 
connections  to  error  out  leaving  ourselves  open  to  denial  of  ser¬ 
vice  attacks.  During  our  experiments  we  empirically  observed  that 
r  =  0.9  was  the  highest  the  parameter  could  be  set  without  risk¬ 
ing  crossing  ConnLimit,  particularly  when  circuits  were  being 
created  rapidly  causing  high  connection  churn. 
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Figure  6:  Comparing  socket  exhaustion  attack  in  vanilla  Tor  with 
Torchestra,  PCTCP  and  IMUX. 

Figure  6  shows  the  effects  of  the  socket  exhaustion  attack  dis¬ 
cussed  in  Section  3  with  Torchestra  and  IMUX  included.  Since 
Torchestra  simply  opens  two  connections  between  each  relay,  the 
attack  is  not  able  to  consume  all  available  sockets,  leaving  the  re¬ 
lay  unaffected.  The  IMUX  results  show  an  initial  slight  drop  in 
throughput  as  many  connections  are  being  created  and  destroyed. 
However,  throughput  recovers  to  the  levels  achieved  with  vanilla 
Tor  and  Torchestra  as  the  connection  manager  stabilizes. 

5.4  Performance 

First  we  wanted  to  determine  how  the  various  connection  sched¬ 
ulers  covered  in  Section  4.2  perform  compared  to  each  other  and 
against  vanilla  Tor.  Figure  5  shows  the  results  of  large  scale  exper¬ 
iments  run  with  each  of  the  connection  schedulers  operating  with 
IMUX.  Round  robin  clean  performs  better  than  both  EWMA  map¬ 
ping  and  the  shortest  queue  connection  schedulers,  at  least  with  re¬ 
spect  to  time  to  first  byte  and  download  time  for  web  clients  shown 


in  Figures  5a  and  5b.  This  isn’t  completely  surprising  for  the  short¬ 
est  queue  scheduler  as  it  indiscriminately  tries  to  push  as  much  data 
as  possible,  favoring  higher  bandwidth  traffic  at  the  potential  cost 
of  web  traffic  performance.  The  EWMA  mapping  scheduler  pro¬ 
duces  slightly  better  results  for  web  traffic  compared  to  shortest 
queue  scheduling,  but  it  still  ends  up  performing  worse  than  vanilla 
Tor.  This  is  related  to  the  issue  with  using  EWMA  in  Torchestra 
for  classification,  that  web  traffic  tends  to  be  sent  in  large  bursts 
causing  the  EWMA  value  to  spike  rapidly  that  then  decreases  over 
time.  So  while  under  the  EWMA  mapping  scheme  the  first  data  to 
be  sent  will  be  given  high  prioritization,  as  the  EWMA  value  climbs 
the  data  gets  sent  to  busier  connections  causing  the  total  time  to  last 
byte  to  decrease  as  a  consequence. 

Figure  7  shows  the  download  times  when  using  IMUX  with  round 
robin  connection  scheduling  against  vanilla  Tor,  Torchestra  and 
PCTCR  While  Torchestra  and  PCTCP  actually  perform  identically 
to  vanilla  Tor,  IMUX  sees  an  increase  in  performance  both  for  web 
and  bulk  downloads.  Half  of  the  web  clients  see  an  improvement  of 
at  least  12%  in  their  download  times,  with  29%  experiencing  more 
than  20%  improvement,  with  the  biggest  reduction  in  download 
time  seen  at  the  75*^  percentile,  dropping  from  4.5  seconds  to  3.3 
seconds.  Gains  for  bulk  clients  are  seen  too,  although  not  as  large; 
around  10%  of  clients  seeing  improvements  of  10-12%.  Time  to 
first  byte  across  all  clients  improves  slightly  as  shown  in  Figure  7a, 
with  26%  of  clients  seeing  reductions  ranging  from  20-23%  when 
compared  to  vanilla  Tor,  that  then  drops  down  to  11%  of  clients 
who  see  the  same  level  of  improvements  compared  to  Torchestra 
and  PCTCP. 

We  then  ran  large  experiments  with  KIST  enabled,  with  IMUX 
using  KIST  for  connection  scheduling.  While  overall  download 
times  improved  from  the  previous  experiments,  IMUX  saw  slower 
download  times  for  web  clients  and  faster  downloads  for  bulk  clients, 
as  seen  in  Figure  8.  40%  of  clients  see  an  increase  in  time  to  first 
byte,  while  87%  of  bulk  client  see  their  download  times  decrease 
from  10-28%.  This  is  due  to  the  fact  that  one  of  the  main  advan¬ 
tages  to  using  multiple  connections  per  channel  is  that  it  prevents 
bulk  circuits  from  forcing  web  circuits  to  hold  off  on  sending  data 
due  to  packet  loss  the  bulk  circuit  caused.  In  PCTCP,  for  exam¬ 
ple,  this  will  merely  cause  the  bulk  circuits  connection  to  become 
throttled  while  still  allowing  all  web  circuits  to  send  data.  Since 
KIST  forces  Tor  to  hold  on  to  cells  for  longer  and  only  writes  a 
minimal  amount  to  the  kernel,  it’s  able  to  make  better  scheduling 
decisions,  preventing  web  traffic  from  unnecessarily  buffering  be¬ 
hind  bulk  traffic.  Furthermore,  KIST  is  able  to  take  packet  loss  into 
consideration  since  it  uses  the  TCP  congestion  window  in  calculat- 


Figure  7:  Performance  comparison  of  IMUX  to  Torchestra  [8]  and  PCTCP  [5] 


(a)  Time  to  first  byte  (b)  Time  to  last  byte  of  320KiB 


(c)  Time  to  last  byte  of  5MiB 


Figure  8:  Performance  comparison  of  IMUX  to  Torchestra  [8]  and  PCTCP  [5],  all  using  the  KIST  performance  enhancements  [10] 


ing  how  much  to  write  to  the  socket.  Since  the  congestion  window 
is  reduced  when  there  is  packet  loss,  KIST  will  end  up  writing  a 
smaller  amount  of  data  whenever  packet  loss  occurs. 

6.  DISCUSSION 

In  this  section  we  discuss  the  potential  for  an  adversary  to  game 
the  IMUX  algorithm,  along  with  the  limitations  on  the  protections 
against  the  denial  of  service  attacks  discussed  in  Section  3  and 
some  possible  ways  to  protect  against  them. 

Active  Circuits:  The  IMUX  connection  manager  distributes  con¬ 
nections  to  channels  based  on  the  fraction  of  active  circuits  con¬ 
tained  on  each  channel.  An  adversary  could  game  the  algorithm 
by  artificially  increasing  the  number  of  active  circuits  on  a  channel 
they’re  using,  heavily  shifting  the  distribution  of  connections  to  the 
channel  and  increase  throughput.  The  ease  of  the  attack  depends 
heavily  on  how  we  define  an  “active  circuit”,  we  can  do  one  of  three 
ways:  (1)  the  number  of  open  circuits  that  haven’t  been  destroyed; 

(2)  the  number  of  circuits  that  has  sent  a  minimal  number  of  cells, 
measured  in  raw  numbers  of  using  an  EWMA  with  a  threshold;  or 

(3)  using  DiffTor  we  can  only  consider  circuits  classified  as  web  or 
bulk.  No  matter  what  definition  is  used  an  adversary  will  still  tech¬ 
nically  be  able  to  game  the  algorithm,  the  major  difference  is  the 
amount  of  bandwidth  that  needs  to  be  expended  by  the  adversary  to 
accomplish  their  task.  If  we  just  count  the  number  of  open  circuits, 
an  adversary  could  very  easily  restrict  all  other  channels  to  only 
one  connection,  while  the  rest  are  dedicated  to  the  channel  they’re 
using.  Using  an  EWMA  threshold  or  DiffTor  classifier  requires  the 
adversary  to  actually  send  data  over  the  circuits,  with  the  amount 


determined  by  what  thresholds  are  in  place.  So  while  the  potential 
to  game  the  algorithm  will  always  exist,  the  worse  an  adversary  can 
do  is  reduce  all  other  IMUX  channels  to  one  connection,  the  same 
as  how  vanilla  Tor  operates. 

Defense  Limitations:  By  taking  into  account  the  connection  limit 
of  the  relay,  the  dynamic  connection  manager  in  IMUX  is  able  to 
balance  the  performance  gains  realized  by  opening  multiple  con¬ 
nections  while  protecting  against  the  new  attack  surface  made  avail¬ 
able  with  PCTCP  that  lead  to  a  low-bandwidth  denial  of  service 
attack  against  any  relay  in  the  network.  However  there  still  exists 
potential  socket  exhaustion  attacks  inherent  to  how  Tor  operates. 
The  simplest  of  these  simply  requires  opening  streams  through  a 
targeted  exit,  causing  sockets  to  be  opened  to  any  chosen  destina¬ 
tion.  Since  this  is  a  fundamental  part  of  how  an  exit  relay  operates, 
there  is  little  that  can  be  done  to  directly  defend  against  this  attack, 
although  it  can  be  made  potentially  more  difficult  to  perform.  Exit 
relays  can  attempt  keep  sockets  short  lived  and  close  ones  that  have 
been  idle  for  a  short  period  of  time,  particularly  when  close  to  the 
connection  limit.  They  can  also  attempt  to  prioritize  connections 
between  relays  instead  of  ones  exiting  to  external  servers.  While 
preventing  access  at  all  is  undesirable,  this  may  be  the  lesser  of 
two  evils,  as  it  will  still  allow  the  relay  to  participate  in  the  Tor 
network,  possibly  preventing  adversarial  relays  from  being  chosen. 
This  kind  of  attack  only  affects  exit  relays,  however  the  technique 
utilizing  the  SocksSProxy  option  can  target  any  relay  in  the  net¬ 
work.  Since  this  is  performed  by  tunneling  OR  connections  through 
a  circuit,  the  attack  is  in  effect  anonymous,  meaning  relays  cannot 
simply  utilize  IP  blocking  to  protect  against  it.  One  potential  so- 


lution  is  to  require  clients  to  solve  computationally  intense  puzzles 
in  order  to  create  a  circuit  as  proposed  by  Barbera  et  al.[6\.  This 
reduces  the  ease  that  a  single  adversary  is  able  to  mass  produce 
circuits,  resulting  in  socket  descriptor  consumption.  Additionally, 
since  this  attack  requires  the  client  to  send  EXTEND  cells  through 
the  exit  to  initiate  a  connection  through  the  targeted  relay,  exits 
could  simply  disallow  connections  back  into  the  Tor  network  for 
circuit  creation.  This  would  force  an  adversary  to  have  to  directly 
connect  to  whichever  non-exit  relay  they  were  targeting,  in  which 
case  IP  blocking  becomes  a  viable  strategy  to  protect  against  such 
an  attack  once  it  is  detected. 


7.  CONCLUSION 

In  this  paper  we  present  a  new  class  of  socket  exhaustion  attacks 
that  allow  an  adversary  to  anonymously  perform  a  denial  of  service 
attacks  against  relays  in  the  Tor  network.  We  outline  how  PCTCP, 
a  new  transport  proposal,  introduces  a  new  attack  surface  in  this 
new  class  of  attacks.  In  response,  we  introduce  a  new  protocol, 
IMUX,  generalizing  the  designs  of  PCTCP  and  Torchestra,  that  is 
able  to  take  advantage  of  opening  multiple  connections  between 
relays  while  still  able  to  defend  against  these  socket  exhaustion 
attacks.  Through  large  scale  experiments  we  evaluate  a  series  of 
connection  schedulers  operating  within  IMUX,  look  at  the  perfor¬ 
mance  of  IMUX  with  respect  to  vanilla  Tor,  Torchestra  and  PCTCP, 
and  investigate  how  all  these  algorithms  operate  with  a  newly  pro¬ 
posed  prototype,  KIST. 
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